Performance of Multicore LUP Decomposition
نویسندگان
چکیده
This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations vary widely: two are implemented in Cilk++, but use different underlying algorithms to perform LUP decomposition; one written in Fortran; and the other is written in C++ with pthreads. This diversity allows us to evaluate which implementation techniques and properties are important to achieve high performance on modern multicore systems. The result of our evaluation indicates the performance of parallelized LUP decomposition is dependent on efficient use of on-chip and off-chip memory resources. Furthermore, dynamic scheduling is an important factor in achieving good performance in multiprogrammed environments.
منابع مشابه
Design of a novel congestion-aware communication mechanism for wireless NoC architecture in multicore systems
Hybrid Wireless Network-on-Chip (WNoC) architecture is emerged as a scalable communication structure to mitigate the deficits of traditional NOC architecture for the future Multi-core systems. The hybrid WNoC architecture provides energy efficient, high data rate and flexible communications for NoC architectures. In these architectures, each wireless router is shared by a set of processing core...
متن کاملEfficient Wavelet Tree Construction and Querying for Multicore Architectures
Wavelet trees have become very useful to handle large data sequences efficiently. By the same token, in the last decade, multicore architectures have become ubiquitous, and parallelism in general has become extremely important in order to gain performance. This paper introduces two practical multicore algorithms for wavelet tree construction that run in O(n) time using lg σ processors, where n ...
متن کاملScheduling dense linear algebra operations on multicore processors
State-of-the-art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time, the coarse–grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workl...
متن کاملScheduling Linear Algebra Operations on Multicore Processors
State-of-the-art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the coarse-grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workloa...
متن کاملMultifrontral multithreaded rank-revealing sparse QR factorization
SuiteSparseQR is a sparse multifrontal QR factorization algorithm. Dense matrix methods within each frontal matrix enable the method to obtain high performance on multicore architectures. Parallelism across different frontal matrices is handled with Intel’s Threading Building Blocks library. Rank-detection is performed within each frontal matrix using Heath’s method, which does not require colu...
متن کامل